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Many pipelined processors use four to six stages. Others divide instruction execu- 
f '''l^';'"^' ^'-^^ P'P«l'"e stages and a faster clock. For example, the 
UltraSI Ar<C II uses a 9-stage pipeline and Intel's Pentium Pro uses a 12-slaee pipeline 
■le latest Iritel proces.sor, Pentium 4, has a2()-stage pipeline and uses a clock speed in 
llie range 1 .3 i„ 1 .5 GHz. For fast operations, there are two pipeline stages in one clock 
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;rwo important features have been introduced in this chapter, pipelining- and mulliple 
issue. I ipehning enables us to build processors with instruction throughput approachiii " 
one instruction per clock cycle. Multiple issue iDakes possible superscalar operalio,r 
with instruction throughput of several instructions per clock cycle, 
•ispcct? P^'"'^""'"ance can only be realized by careful attention to three 

■Phe instruction sci of the processor 
The design of the pipeline h.ardware 
The design of the associated compiler 

It is important to appreciate that there arc strong iiileraclions among all three Mit;[, 
pcrlorinance ,s critically dependent on the extent to which these interactions arc taken 
into account in the design of a processor. Instruction sets that arc particularly well-suited 
lor pipelined execution are key features of modern processors. 



8. 1 Consider the following sequence of instructions 

Add ^f20,IiO,Ri 
Mul #3,R2,R3 

And #$3A.R2.R4 
Add R0.R2.R5 

In all iirstiwtions, the destination operand is given last. Initially, registers RO and R2 
contain 2000 and 50, respectively. These instructions arc executed in a computer that has 
a four-stage pipeline similar to that shown in Figure S.2. Assume thai the fu st instruction 
IS lelched m clock cycle 1. and that instructi(,n fetch rcquucs only one clock cycle. 

[a) Draw a diagram similar to Figure 8.2fl. Describe the operation being performed by 

each pipeline stage during each of clock cycles 1 through 4. 
(/;) Give the contents of the interstage buffers, BI, B2, and B3, during clock cycles 2 
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8.2 Repeat Problem 8. 1 for the following program: 

Add #20,R0,R1 
Mul #3,R2,R3 
And #$3A,R1,R4 
Add R0,R2,R5 

8.3 Instmctioi^^h in Figure 8.6 is delayed because it depends on the results of I, .By 
occupying the Decode stage, instruction I2 blocks I;,, which, in turn, blocks I4. Assuming 
thatj3jan(dj4 5I0 not depend on either I| or I2 and that the register file allows two Write 
steps"to pfcTceed in parallel, how would you use additional storage buffers to inakc it 
possible for I, and I4 to proceed earlier than in Figure 8.67 Redraw the ligure, showino 
the new order of steps. 

8.4 The delay bubble in Figure 8.6 arises because instruction h is delayed in the Decode 
stage. A.s a result, instructions Ij and I4 are delayed even if they do not depend on either 
I| or \i. Assume that the Decode stage allows two Decode steps lo proceed m parallel. 
Show thai the delay bubble can be complelely eliminated if the register file also allows 
two Write steps to proceed in parallel. 

8.5 Figure 8.4 shows an instruction being delayed as a result of a cache miss. Redraw this 
figure for the hardware organization of Figure 8.10. Assume that the instruction queue 
can hold up to fou!' instructions and that the instruction fetch unit reads two instructions 

at a time from the cache. 

8.6 A program loop ends witli a conditional branch lo the beginning of the loop. How would 
you implement Ihis loop on a pipelined computer that uses delayed branching with one 
delay slot? Under what conditions would you be able to put a useful instruction in Ihe 
delay slot? 

8.7 The branch instruction of the UltraSPARC 11 processor has an Annul bit. When set by 
the compiler, the instructit)n in the delay slot is discarded if the branch is not taken. An 
alternative choice is to have the instruction discarded if the branch is taken. When is 
each of these choices advantageous? 

8.8 A computer has one delay slot. The in.struction in this slot is always executed, but only 
on a speculative basis. If a branch docs not take place, the results of lhaf instruction arc 
discarded. Suggest a way to implement program loops efficiently on this computer 

8.9 Rewrite the sort routine shown in Figure 2.34 for the SPARC processor. Recall that the 
SPARC architecture has one delay slot with an associated Annul bit and uses branch 
prediction. Attempt to fill the delay slots with useful instructions wlierever possible. 

8.10 Consider a statement of the form 

IF A>B THEN action i ELSE action 2 

Write a sequence of assembly language instructions, first using branch instructions 
only, then using conditional instniclions such as those available on the ARM processor. 



www.elsolucionarin net 



by) -^p,^ 



5 





s 



s ' \ -1:5 

I oy^ra^^ \ojp.crcr.J- \o^^ 



f 



iso 



t 0' 




(5> 



if 



i \ 



Vfy'tai^^cl i';-^ Cqcic ^ko^-Jn 



i.Qi.O , gut 



I 



^ I ''UC Lie 1/1 J 



of -lin^ip. 



i. 



3 ^ ^ i 



'7 fcLC'Jd J 



,1 



-3 



7:^ 



2 



3" / 



V3 



10, 



2i 



7 



.^5 









1 J 



11 



X, 



'2 



^ f C 6 m p i. -ef , ^ / h J h'^/ c 1 m 1 



V 



7 



if 



1^ ez-cov-i^fi/ 



^7 

)9 i^i; 



